Automatic Extraction of Semi-structured Web Data

نویسندگان

Fang Dong

Mengchi Liu

Yifeng Li

چکیده

As a huge data source the internet contains a large number of valuable information, and the data of information is usually in the form of semi-structured in HTML web pages. In order to extract the web data and organize the data with the relationships which are similar to the real world, this paper has proposed a method for automatic data extraction from the web. With the combination of keywords and database content matching, the target web pages which contain valuable data will be crawled. Via HTML structure and visual features, extracting the data from the web pages crawled. Eventually, the data been extracted will be integrated to the structure of information network model. Experimental results indicate that this method can be able to apply to semi-structured data extraction in the web, and this paper has provided positive significance to extraction and manage semistructured web data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

JIS 28/2 00 prelims

Ontology is an important emerging discipline that has the huge potential to improve information organization, management and understanding. It has a crucial role to play in enabling content-based access, interoperability, communications, and providing qualitatively new levels of services on the next wave of web transformation in the form of the Semantic Web. The issues pertaining to ontology ge...

متن کامل

Bootstrapping Information Extraction from Semi-structured Web Pages

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. T...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Ontology research and development. Part 1 - a review of ontology generation

Ontology is an important emerging discipline that has the huge potential to improve information organization, management and understanding. It has a crucial role to play in enabling content-based access, interoperability, communications, and providing qualitatively new levels of services on the next generation of Web transformation in the form of the Semantic Web. The issues pertaining to ontol...

متن کامل

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in whi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Automatic Extraction of Semi-structured Web Data

نویسندگان

چکیده

منابع مشابه

JIS 28/2 00 prelims

Bootstrapping Information Extraction from Semi-structured Web Pages

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Ontology research and development. Part 1 - a review of ontology generation

Automatic Hidden-Web Table Interpretation by Sibling Page Comparison

عنوان ژورنال:

اشتراک گذاری